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The multinomial language model has been one of the most effective models of retrieval for over a decade. 
However, the multinomial distribution does not model one important linguistic phenomenon relating to 
term-dependency, that is the tendency of a term to repeat itself within a document (i.e. word burstiness). 
In this article, we model document generation as a random process with reinforcement (a multivariate 
Polya process) and develop a Dirichlet compound multinomial language model that captures word burstiness 
directly. 

We show that the new reinforced language model can he computed as efficiently as current retrieval mod¬ 
els, and with experiments on an extensive set of TREC collections, we show that it significantly outperforms 
the state-of-the-art language model for a number of standard effectiveness metrics. Experiments also show 
that the tuning parameter in the proposed model is more robust than in the multinomial language model. 
Furthermore, we develop a constraint for the verbosity hypothesis and show that the proposed model ad¬ 
heres to the constraint. Finally, we show that the new language model essentially introduces a measure 
closely related to idf which gives theoretical justification for combining the term and document event spaces 
in tf-idf type schemes. 
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1. INTRODUCTION 


Language modelling approaches to information retrieval have become increasingly 


popular since the original works [Ponte and Croft 1998t Hiemstra !1998 


2001 


Lavrenko and Croft||2()01| |Zhai and Lafferty||2001a|. They afford a particularly ap- 
pealing view of the retrieval problem due in part to the principled nature in which a 


retrieval functio n can be mathematically derived. The query likelihood method | Ponte 
land Cro^|1998] is one of the most widely-adopted approaches to retrieval, and ranks 
documents based on the likelihood of their document language model generating the 
query string. The most widely-accepted multinomial language model treats the doc- 
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ument model as a multinomial distribution over the terms, where the parameters of 
each document model are estimated using the observations from the actual document 


smoothed with the e ntire collection using the Dirichlet prior smoothing method | Zhai 
and Lafferty|200Ta |. 

One main deficiency with using a multinomial distribution as a language model is 
that all term occurrences are treated independently. The term-independence assump¬ 
tion in information retrieval is often adopted in theory and practice as it renders the 
retrieval problem tractable, simplifies the implementation of many models, and has 


term-dependencies [Metzler and Croft||2005[ 

Zhao and Yun]|2009 Lv and Zhai 

2009a 

Cummins and O’liiordan|2009 

Bendersky and Croft||2012| have been shown 

m gen- 


guage modelling approach that has the same complexity as a unigram language model 
but also incorporates dependencies, would be a useful contribution as it would likely 
exhibit increased effectiveness at no extra computational cost. In fact, the use of the 
multinomial distribution in the standard language modelling approach ignores two 
types of dependencies; namely, the dependency between distinct term^(word types) 
and the dependency between recurrences of the same term (word tokens). It is this 
second type of dependency that we address in this article. 

It is well known that once a term occurs in a document, it is more likely to re-appear 
in the same document. This ph enomenon is known as word burstiness [Church and 


Gale||1995 Madsen et al.||2005| , a nd is a type of dependen cy that is not modelled in 
the multinomial language model |Zhai and Lafferty |2004| . Essentially, word bursti¬ 
ness can be defined as the tendency of an otherwise rare term to occur multiple times 
in a document, and can be seen as a form of preferential attachment ]Simon|[l955 


Mitzenmacher|2003| . One theory for this phenomenon is that an author tends to sam- 


ple terms previously written in the same document to form association [Simon|1955| . 
The process of association of similar concepts throughout a document using the same 
lexical form may aid coherence, readability, and understanding. For example, if an au¬ 
thor starts to use the term pavement in an article, he/she intuitively tends to continue 
its usage throughout the document, rather than changing to one of its synonyms (e.g. 
sidewalk or footpath). 

On the other hand, queries are requests for information and are generated with a 
different motive in mind. When requesting or searching for information a user is more 
likely to expand the vocabulary used in the query (and possibly make use of synonyms) 
in the hope of matching those query-terms contained in relevant documents. Further¬ 
more, queries are usually much shorter than documents and as a result, we assume 
that queries are less likely to exhibit word-burstiness. That is not to say that a certain 
term could not appear multiple times in a query, it simply suggests that the reason for 
it reappearing is different than in a document. For these reasons we model documents 
and queries using different generative assumptions. 

This article presents the SPUD (Smoothed Polya Urn Document) language model 
that incorporates word burstiness only into the document model. We use the Dirich¬ 
let compound multinomial (DCM), also known as the multivariate Polya distribution, 
to model documents in place of the standard multinomial distribution, while we use 
the standard multinomial to model query generation. We show that this new retrieval 
model obtains significantly increased effectiveness compared to the current state-of- 
the-art model on a range of datasets for a number of effectiveness metrics. This article 
is organized as follows. Section 2 introduces notation used in the remainder of the ar¬ 
ticle and also presents a comprehensive review of relevant research. Section 3 reviews 


^This is the traditional term-independence assumption. 
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the standard language modelling approach. Section 4 presents the SPUD language 
model. Section 5 outlines efficient forms of the new retrieval functions, and provides 
deep insights into the proposed functions. The experimental design and results are 
presented in Section 6. Section 7 presents a discussion of the results and Section 8 
concludes with a summary. 

2. RELATED RESEARCH 

In this section we review related work in language models and word burstiness, before 
outlining the main contributions of this work. Table [I] introduces notation used in the 
remainder of this article. 


Table I. Feature Notation 


Key 

Description 

c(£, d) 

frequency of term t in document d 

c(t, q) 

frequency of term t in query q 

Ml 

length of document d (i.e. number of word tokens) 

Ml 

length of document vector (# of distinct terms in document d ) 

eft 

collection frequency (frequency of t in the entire collection ) 

dft 

document frequency (number of documents in which t occurs) 

Ml 

length of query q (i.e. number of word tokens) 

Ml 

number of tokens in the entire collection c 

n 

number of documents in the collection 

Ml 

vocabulary of the collection (# of distinct terms in the collection) 


2.1. Query Likelihood 

The predominant method of ranking documen ts using the language modelling ap¬ 
proach remains the query likelihood method of Ponte and Croft | |1998l |. In the query 
likelihood method, documents are ranked based on the likelihood of their document 
model, Md, generating the query string. The following equation shows how the query 
likelihood, p{q\M.d), is calculated for a unigram multinomial language model: 


p{q\Md = 0dm) = (1) 

tGq 

where q is the query string and 9dm is the multinomial document language model. 
The effectiveness of this retrieval method crucially depends on the estimation of the 
document model 8dm- It is typically estimated using the actual document d and is 
smoothed with the background language model which is estimated from the entire 
collection c. When using a multinomial, the query likelihood method (Eq. can be 
rewritten in a rank equivalent form as follows: 


log p{q\Md = Odm) = p{t\6dm) ■ c{t, q)) 

tGq 


( 2 ) 


which shows that, as with most other retrieval functions (e.g. BM25 | Robertson et al. 


1994)), the scoring function comprises a summation of query-term weights. lfp(t|0 


dm ) 


is estimated using only the maximum-likelihood estimates of a term occurring in a 
document (i.e. c{t,d)/\d\), over-fitting would occur. For instance, this would result in 
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any document that did not contain all query-terms not being retrieved, as its docu¬ 
ment model deemed to have generated the query with a probability of zero (see Eq.|^. 
It should also be noted that when substituting the maximum likelihood probabilities 
{c{t,d)/\d\) into Eq. the weight of each term becomes log{c{t, d)/\d\) which has the 
effect of reducing the weight contribution of successive occurrences of the same term to 
a document score. Thi s non-linear term-frequency effect has been often reported as a 
useful heuristic in IR |Fang et al.|2004 Fang and Zhai|2005[|Cummins and O’Riordan 


2007 Clinchant and Gaussier||20i0 201 1[ Lv and Zhai||2012) . However, in the multi¬ 

nomial query likelihood retrieval method this non-linearity is only the consequence of 
a mathematical transformation, and the actual dependency between successive occur¬ 
rences of the same term is not modellecfl 

2.2. Advances in Language Models 


Since the initial work applying language models | Ponte and Croft 1998 Hiemstra 


1998| to information retrieval, there have b een a number of advances m terms of both 


theory and practice. Graph-based models iGao et 
Blanco and Lioma||20121 IBendersky and Croft|12( 


aLl2004 Metzler and Croft 2005 


_ Lioma| 2012[ [Bendersky and Croft 2012) that capture aspects of term- 

dependency have been shown to improve retrieval performance over unigram mod¬ 
els. F urthermore, positional-based language models | Zhao and Yun|20^ |Lv and Zhai 


2010) have been proposed and incorporate term dependencies that often span several 


terms. In general, the incorporation of term-dependency information in larger web col¬ 
lections has been shown to be beneficial to retrieval quality. 

Although many language modelling approaches to information retrieval use the 
query-likelihood approach to ranking, it is not the only means of indu cing a ranking 


using langu age models. In particular, relevance-based language models | Lavrenko and 
Croft |[MoT| estimate a relevance model from which all relevant documents for a par¬ 


ticular information need are assumed to have been drawn. The approach to ra nking in 
that work is simil ar to the classic probabilistic document retrieval approaches | Sparck-| 
Jones et al.|2000[, where documents are ranked based on the odds of being drawn from 


the relevant class compared to non-relevant class. The relevance-based language mod¬ 
elling approach provides a principled mechanism in which the retrieval model can be 
updated as relevant and non-relevant documents become known. This ap proach led to 
the development of pseudo-relevance query expansion language models | Abdul-jaleelj 
let al.|200^ Diaz and Metzler 2006||Lv and Zhai|2010) . 

The lan^age modelling approach has now become a starting point from which more 
complex models can be built. Aside from pseudo-relevance query expansion, other ap¬ 
proaches such as latent Dirich let allocation (LDA) have been incorporated into ad-hoc 
retrieval I jWei and Croft|2006| . In essence, improving the retrieval effectiveness of the 
standard language modelling approach to information retrieval can ultimately benefit 
any of the myriad of approaches which depend upon it (e.g. pseudo-relevance feedback). 

2.3. Word Burstiness 

The modelling of word burstiness in documents has been addressed before in text- 
related tasks, but it has not been incorpor ated with the query likelihood method in 
information retrieval. [Madsen et al.| B20051 use the DCM distribution to model word 
burstiness and demonstrate its effectiveness on document classification. They estimate 
a DCM model for each class from training data. They then classify an unseen document 
to a specific class according to the mostly likely generative DCM class model. They 


^This is an important point as some previous work tends to suggest that a non-linear term-frequency factor 
in a linear combination of term-weights can capture some aspect of dependency or burstiness. We contend 
that this is not the case for language models that use a multinomial as their basis. 
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show that this DCM model outperforms the more standard multinomial model. The 
information retrieval task is somewhat different as it deals with both documents and 
queries. In our work we have different generative assumptions for both documents 
and queries. A further difference is that in the classification task there are a number 
of documents from which we can infer a particular class model, while in the query- 
likelihood approach to information retrieval we have access to only one instance of a 
document from the document model. 

Due to the complexity of estimating parameters for the DCM, jElkan 1 2006) devel¬ 
oped an approximate distribution (the EDCM) and demonstrated its effectiveness for 
clustering. We make use of this approximation later in this article. T he DCM has also 
been used in a hypergeometric language model I Tsagkias et al.|2011 | for modelling the 
characteristics of very long queries. In other work, a two-stage language modelling ap¬ 
proach has been developed fColdwater et al.||2011 | that generates words according to 
the power-law characteristics of natural lan^age. They decompose the language gen¬ 
eration process into a generator, which creates instances of word types, and an adaptor 
which has the tendency to repeat those specific word types. Further arguments which 
link prefere ntial attachment to the power-law ch aracteristics of natural language are 
reviewed by |Mitzenmachei^)2003| . Cowans] | |2004) uses a hierarchical Dirichlet process 
to arr ive at a rankin g function which is reported as being superior to BM25. Related 
work | |Sunehag|2007| provides some interesting connections between the traditional tf- 
idf weighting scheme and the two-stage generator-adaptor models. Our work is more 
extensive and actually develops a document language model from which retrieval func¬ 
tions are derived. 


In recent work, an extension of earlier information-based approaches | Amati and 
Van Rijsbergen |200 2j is developed that incorporate s burstiness in a log-logistic re¬ 
trieval function fClinchant and Gaussier 2009||201l) . The authors develop a means for 

bur 


identifying if a term-frequency distribution is bursty. They conclude that the frequency 
distribution must be a type of power-law (or Pareto-type) distribution. Our work is 
much more in the spirit of generative language modelling where the term-frequency 
aspect occurs naturally from the model (in our case a hierarchical Bayesian approach) 
to introduce dependencies between subsequent occurrences of the same term. Our 


model also exhibits power-law characteristics consistent with the work by Clinchant 
land Gaussierj i l^lll . 

The work most similar to ours uses the DCM distribution to develop a probabilistic 
relevance-based language model | Xu and Akella|2008 20101. For each query they es¬ 
timate a relevant and non-relevant DCM model and it is assumed that all documents 
are generated from either of those two models. However, our work does not assume a 
relevance model and instead, assumes that each document is generated from a differ¬ 
ent document model. This means that we model burstiness on a per document basis, 
rather than modelling burstiness for a set of relevant (and non-relevant) documents. 
It is more likely that different documents are bursty to different degrees as they were 
written by different authors, and this is not modelled in the relevance-based approach 
of Xu and Akella. Our model is a query-likelihood approach using different generative 
assumptions for both the document and query, and leads to retrieval functions that are 
distinct fro m those in the aforeme ntioned relevance-based approach. 

Although Xu and Akella | 2008| | report some improvement in retrieval effectiveness 
over the multinomial query likelihood retrieval method on some test collections, their 
experiments were restricted to relatively small collections (less than a million docu¬ 
ments) and used only short keyword queries. It is unclear if their results extend to 
a more general retrieval scenario. We perform a more robust analysis by using their 
best approach (DCM-L-T) as one of our main baselines on a variety of different query 
lengths and collection sizes. We also discuss the difference between our approach and 
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the relevance-based DCM approach of Xu and Akella in our discussion section (Sec¬ 
tion 7.1). 


2.4. Contributions 

To our knowledge no existing work has developed a document language model for in¬ 
formation retrieval using the generative assumptions outlined in this work. Therefore, 
the main contributions of this article are as follows: 

• We propose a new family of document language models that capture word burstiness 
in a probabilistic manner. 

• We develop closed-form expressions for the retrieval functions derived from the new 
language model, and show that our retrieval functions are as efficient as traditional 
bag-of-words retrieval functions. 

• We show that the proposed language model implements several important retrieval 
heuristics not captured in the multinomial language model, such as modelling the 
scope hypothesis and the verbosity hypothesis separately. 

• We show that the modelling of word burstiness in the new language model leads 
to significant improvement in retrieval effectiveness for ad hoc retrieval and for 
downstream methods such as pseudo-relevance feedback. 

We now briefly review the query likelihood retrieval method and the multinomial 
language model. 

3. MULTINOMIAL LANGUAGE MODEL 

In this section we review details of the multinomial query likelihood model and some 
useful approaches to smoothing. 


3.1. Document and Background Models 

As outlined earlier, it is the selection of the generative model and the subsequent esti¬ 
mation of the document language model that is crucial to re trieval effectiveness using 
the q uery likelihood retrieval method. It has been shown | Zhai and Laffer^ 2001a 
|2004| that effective estimates of the probability of term occurrences for the multino¬ 
mial document language model 0dm can be found as follows: 


P{t\0dm) = (1 - Tt) • _p(t|0d) + TT • p(t|0c 


(3) 


where 6dm is the estimated smoothed document language model and tt is a smoothing 
parameter which controls the amount of probability mass that should be redistributed 
from the background multinomial p{t\6^) to the document multinomial p{t\6d)- This 
prevents over-fitting of the document model because in most retrieval formulations 
both p{t\6d) and p(t|0c) are estimated using maximum likelihood estimates c{t,d)/\d\ 
and cft/\c\ respectively. The background multinomial is estimated using all documents 
in the entire collection and therefore all tokens in the corpus are treated as inde¬ 
pendent observations. The background model can be viewed as the most likely sin¬ 
gle model to have generated all of the documents. It has been shown that the choice 
of smo othing greatly affects th e retrieval effectiveness of the multinomial language 
model I Zhai and Laferty|2004 |. 


3.2. Smoothing 

One of the simplest forms of smoothing uses linear-interpolation, also called Jelinek- 
Mercer smoothing, where is assigned a value in the range (0 — 1). In this linear 
smoothing approach, the parameter is usually set by experimentally tuning on 
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training data. Typically there has been no guidance on the setting of this parameter 
as the effectiveness of this smoothing approach is quite sensitive to specific parameter 
values. However, a more effective smoothing method for the multinomial language 
model uses Bayesian smoothing in the form of a Dirichlet prior on the background 
multinomial. For this approach tt^.^ is defined as follows: 


H+\d\ 


(4) 


where fj, is the concentration parameter and is the sum of the individual |z;|-Dirichlet 
parameters. This concentration parameter is also assigned a value based on experi¬ 
mentation, thou gh it has been found tha t it achieves a relatively stable performance 
when /z = 2000 | Zhai and Laffer^|2004) . The Dirichlet prior parameter p can be in¬ 
terpreted as the number of pseudo-counts of the background multinomial prior to the 
document data. Intuitively, this type of smoothing gives a greater credence to prob¬ 
ability estimates that are derived from longer documents, compared to those derived 
from shorter documents, as the longer documents are likely to be more accurate rep¬ 
resentations of the document model. The prior parameters (pseudo-counts) of the |ti|- 
component Dirichlet distribution are at = fi ■ p(t|0c) for all t € v and are updated 
using the document observations to at = ■ pit\6c) + c(t, d) for all t € v. Therefore, the 

concentration parameter of the Dirichlet distribution changes from p to p + \d\ once 
the distribution has been updated. Throughout this article we will continue the con¬ 
vention of specifying a |z;|-component Dirichlet using the parameters of a multinomial 
distribution (with |r|-l degrees of freedom) multiplied by a concentration parameter 
(i.e. p). 

It has been shown that the query likelihood model with Dirichlet prior smoothing 
and the model with Jelinek-Mercer smoothing can be implemented as efficiently as 
traditional retrieval functions, which only use weights from terms that are common to 
both document and querjj^ 


4. A SMOOTHED POLYA URN DOCUMENT MODEL 

In this section we first introduce the generalised Polya urn model and outline some of 
its important characteristics. We then show how this can be used to model document 
generation before specifying the query likelihood approach for the new model. Finally, 
we outline how the parameters of the SPUD model are estimated and smoothed. 


4.1. A Polya Urn Process 

Consider a process that starts with an urn containing m balls in total, where each ball 
is one of |t!| distinct colours. Starting at time i = 0, a ball is sampled with replacement 
from the urn, and a ball of the same colour is replicated and added to the urn. This 
process continues until |(i| balls have been sampled from the urn. The total number of 
balls in the urn at the end of the process is m -I- |c?|. This is a typical description of the 
multivariate Polya urn model which uses sampling with reinforcement. We use this 
process as a conceptual model for document generation, where the different colours 
represent distinct terms, where the initial counts of the |z;| different coloured balls in 
the urn represent the document model, and where the |(i| observations drawn represent 
the actual document. 

This multivariate Polya urn model has recently been described in an al ternative 
manner as consisting of a multinomial and the Chinese restaurant process [Sunehag 


^See the original source Zhai and Lafferty 2004 for the derivations of these efficient retrieval functions. 
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20071 jGoldwater et al.|20Tll. Again, consider an urn that contains m balls of |?;| differ¬ 
ent colours, but now also consider a bag d that is initially empty. For all times starting 
at time i = 0,a ball is chosen from the urn with probability mj{m + i) and from the bag 
with probability i/{m + i), and each time it is replaced from where it was drawn. For 
each draw, a ball of the same colour that was drawn is generated and placed in the bag. 
In this alternative description, the number of balls m in the urn remains static, while 
the number of balls in the bag d is i at any particular time. The non-reinforced urn can 
be modelled as a multinomial and the bag can be modelled as the Chinese restaurant 
process. 

This tw o-stage generati ve process has been outlined recently by [Goldwater et al. 


i 


|2011| and|Sunehag||[200T|, and while the entire process is identical to the multivariate 


olya urn model described previous, it may be more intuitive in terms of a generative 
story of document creation. This is because the document is modelled as a separate 
entity that starts empty, and ends after |d| terms have been drawn. We re-introduce the 
alternative description here only to motivate the application of this pro cess to that of 
document generation. This is very much in the spirit of that proposed by [Simon | 


where an author generates a document by drawing words from some distribution am 
also by drawing words from those previously used in the document in order to create 
association. For the reminder of the article, when we refer to an urn, we mean a Polya 
urn by default, unless otherwise stated. 

It is well-known that the distribution of colours in the multivariate Polya process fol¬ 
lows the DCM (multivariate Polya distributi on). It is also kn own that the Polya urn is 
an example of a bounded martingale process | Pemantle 2007| , where the proportions of 
colours in the urn converges to a Dirichlet distribution. During the process, the draw¬ 
ing, subsequent replication, and addition of an observation (which must be identically 
distributed to the initial distribution) only serves to reinforce the initial distribution. 
Therefore, all subsequent balls drawn from the Polya urn are identically distributed, 
but are not independent. Furthermore, the process is exchangeable, meaning that the 
ordering of the outcomes can be swapped to result in the same probability distribu¬ 
tion. Therefore, the document model remains a bag-of-words because the ordering of 
the terms in the document is not modelled. 


4.2. Document Generation as a Polya process 

We use the Polya urn, and therefore the DCM, as a model for document generation 
where the author generates an actual document d by drawing [dj terms from the rein¬ 
forced document model. Intuitively, different documents are written in different styles 
(some styles exhibiting more word burstiness than others), and therefore, the degree 
of reinforcement will be document specific. Consequently, we assume that each docu¬ 
ment is drawn from a different document DCM, and therefore we need to estimate the 
parameters of a different document DCM for each document d. 

The probability density function for the DCM is as follows: 

p{d\cy.) = f p{d\6)p{6\a)d9 (5) 

Je 

where a is the initial |r [-dimensional parameter vector of a Dirichlet distribution. Con¬ 
ceptually, one can think of drawing a multinomial 6 from a Dirichlet distribution speci¬ 
fied by OL, and subsequently drawing a sample d from the multinomial. The parameters 
of the DCM can be interpreted as the initial number of instances of each coloured ball 
in the Polya urn. Therefore, the sum of the DCM parameter vector at can be 
interpreted as the initial number of balls in the urn (i.e. nid = and is the con¬ 

centration parameter. This is the factor that controls burstiness on a document level. 
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and when rud is large the model exhibits low burstiness as adding balls to the urn 
changes the state of the urn very little. In fact , when — » oo, the DCM tends to 
the multinomial distribution (i.e. no burstiness) | Elkani|2006| . Conversely, if there are 
very few balls in the urn initially (i.e. rud —> 0), the model exhibits high burstiness as 
the first ball drawn alters the initial state of the urn by reinforcement quite substan¬ 
tially. Therefore, the problem lies in estimating the initial parameters of the document 
DCM <y.d given that the document d was generated by this reinforced random process. 
For consistency, the notation we use to specify the |i;|-dimensional parameter vector 
of the DCM is similar to that of the Dirichlet distribution (i.e. using a multinomial 
distribution and a concentration parameter). 

Furthermore, given that documents only contain a subset of the terms in the col¬ 
lection, we do not wish to assign zero probabilities to terms that do not occur in a 
document. Therefore, we smooth each document DCM ad with a background DCM 
model a^. The background model is the single model most likely to have generated all 
documents given our reinforced process, and therefore, we estimate the parameters 
of a background DCM ac, given all of the n documents. There are different ways in 
which we can smooth these two DCM models and we will outline these in Section 4.6. 
In general, we are not restricted to smoothing only two DCM models to construct our 
document model, and any number of plausible DCM models could be combined to help 
explain observations in the document. However, in this article we confine ourselves to 
smoothing only two DCM models for each document d. 


4.3. Non-Reinforced Query Likelihood 

Once the parameters of the document model (Md = oLdm) have been estimated, we 
need to rank these document models with respect to a query. In the multinomial lan¬ 
guage model, both the document and query are assumed, for the purposes of ranking, 
to have been generated from a multinomial. This simplifies the estimation of the doc¬ 
ument model and the estimation of the query likelihood given the document model. 

As mentioned earlier, we assume that documents and queries are generated differ¬ 
ently. More specifically we assume that queries do not exhibit word burstiness. This 
in fact simplifies the query likelihood given our new document model. We assume that 
the documents are generated from a DCM document model adm, and that the query 
is generated from the document model (urn) using sampling with replacement (no re¬ 
inforcement). Modelling query generation in this manner means that each term in 
the query is treated independently. Consequently, documents are ranked according to 
following query likelihood formula: 


log p{q\Md = E[9d7n\oidm]) = log = '^{log p{t\Md) ■ c{t, q)) (6) 

tGq tGq 

where E[9dm\oidm] is the expected multinomial of the DCM document model for docu¬ 
ment d. 

4.4. Estimation of the Document DCM 

We now estimate the parameters of the document DCM ad using the observations from 
the actual document d. Given only one sample (i.e. the document) it is not possible to 
fully specify the maximum likelihood estimates of the document DCIVq The maximum 
likelihood estimates of the multinomial inferred from one document will be equal to the 
expected value of the estimated DCM. Therefore, the maximum likelihood estimates 


^The minimum number of samples needed to estimate both the expected value (a multinomial) and the 
concentration parameter (burstiness) is two. 
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Fig. 1. Documents generated from multinomials drawn from a Dirichlet distribution for both document 
(left) and background language models (right) 

of the multinomial from which the terms in the document were drawn (i.e. c{t, d)/\d\) 
will be proportional to the maximum likelihood estimates of the document DCM (i.e. 
6d oc Cid). This is only true in the case where there is one sample. 

Fig. (left) shows this graphically for a simplified two-dimensional model that uses 
white and black balls to represent terms. The x-axis represents multinomials of vary¬ 
ing parameter values. Points on the left-hand side of the x-axis represent multinomials 
where the probability of drawing a white ball are high, while points on the right-hand 
side of the x-axis represent multinomials where the probability of drawing a black ball 
are high. The Dirichlet distribution represents the likelihood of drawing these multi¬ 
nomials. In FigU(left), the expectation of both of the two-dimensional Dirichlet distri¬ 
butions, shown by the red and blue curves, are equal and represent the multinomial 
(red arrow) inferred from the document. 

Therefore, when we have only one multinomial (inferred from a document), we can 
only specify the location (expected multinomial) and not the shape (concentration pa¬ 
rameter) of the DCM. In order to completely define the parameters of the document 
DCM, we also have to define the concentration rud = ^dt, which can be inter¬ 
preted as the level of belief associated with the maximum likelihood estimates of the 
expected multinomial. Therefore, the initial parameters of the |r |-component document 
DCM are estimated as follows: 


&d = md ■ Od = {md ■p{ti\d),md ■ p{t 2 \d), ....,md ■ p{t\^i\d)) (7) 

where p{t\d) = c{t,d)/\d\ for aWt ^ v and where is the initial mass that controls 
the burstiness of the document model. Although estimation of the paramete rs of the 
DCM using multiple data vectors is computationally expensive | Minkaj|2000| , we can 
see that estimating the parameters of each document DCM is trivial if a suitable value 
for rud can be found. 

Given that is the level of belief associated with the expected document multi¬ 
nomial OLd, it would seem intuitive to aim to minimise this belief in the absence of 
evidence (an Occam’s razor type argument). A minimum setting can be arrived at by 
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determining the minimum initial number of balls in the urn that could have generated 
the document. Given a document d, the minimum number of balls initially in the urn 
is the number of distinct coloured balls drawn. Therefore, we estimate the concentra¬ 
tion parameter of the document DCM as ifid = |c?|. This is the maximum amount of 
burstiness that is supported using this argument. In Fig. [^(left) our estimate of ma for 
the document model is rud = 2, which leads to the shape of the Dirichlet in blue. Set¬ 
ting rud according to this parsimonious principle ensures that we have not over-fitted 
to our data. 


4.5. Estimation of the Background DCM 

For the DCM document models, there exists dependencies between successive occur¬ 
rences of the same term in a document, and therefore, the estimation of the background 
DCM is more complex than for the multinomial distribution. In fact, in the entire col¬ 
lection, the only occurrences of the same term that are independent of each other are 
those in different documents. This leads to the introduction of a document boundary 
into the background DCM of the new language model, something that is lacking in the 
multinomial language model. 

The estimation of a background DCM using al l n docu ment vectors is, as mentioned 
previously, computationally expensive. However, [Elkan |2006l l has shown that, for tex¬ 
tual data, very close approximations to the maximum likelihood estimates of the DCM 
(via the EDCM) are proportional to d^) > 0) for all t &v, where I is the indi¬ 

cator function. These approximations are accurate for textual data because most terms 
do not occur in all n documents, and furthermore, it has been shown that the approx¬ 
imations make little difference to the effectiveness of the model for text-related tasks. 
It can be seen that this approximation relates to the number of documents in which a 
term occurs (i.e. the document frequencj0. Using an appropriate normalisation factor 
we obtain a probability estimate as follows: 


p{t\0'c) = 


T,']=iHcit,dj) > 0 ) 
Et'ev dft' 


dft 


dft 




( 8 ) 


where n is the number of documents in the collection and the numerator is the docu¬ 
ment frequency of a term. The normalisation factor can be re-written and comprises 
the summation of all document vectors in the collection so that J2tevPi^\^'c) = 1- 
This probability distribution can be viewed as the expected multinomial drawn from 
the background EDCM. The estimates of the background DCM, which are approxi¬ 
mately proportional to these probability estimates, are defined in a similar manner to 
the document DCM by introducing one concentration parameter rric. This results in 
the following parameter estimates for the background DCM: 


&c = {rric- p{ti\0'c),mc ■p{t2\9'c), • p(t|«| |0'c)) (9) 

where rric is the belief in the expected value of the Dirichlet (i.e. p{t\9'c)) and can be 
interpreted as a type of document word burstiness throughout the collection. 

Fig.ffl(right) shows a graphical example of a two-dimensional background Dirichlet. 
As befwe, the x-axis determines parameter values of the multinomials, and the black 
curve shows the likelihood of drawing these multinomials. The four document samples 
shown in the figure exhibit high levels of burstiness as they contain a disproportionate 
number of balls of one specific colour. This is because areas of higher likelihood in 


®This introduces an id/-like measure into this language model and is discussed in a later in Section 5.2.4 
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Fig. lead to multinomials with one component that contains most of the probability 
mass. The convex shape of this curve is due to a low rric concentration parameter, and 
therefore models high levels of word burstiness. Although the expectation of this over¬ 
dispersed two-dimensional Dirichlet has a low likelihood, it is nonetheless expected in 
the statistical sense. Essentially, the use of a DCM explains greater term-frequency 
variation in the n documents in the collection. 

4.6. Smoothing and Retrieval Modeis 

We now present two smoothing methods which can be used to linearly combine K 
multiple DCM models. 



Fig. 2. Document generation in the SPUD model for both types of smoothing 


4.6.1. Linear Smoothing of Expected Multinomials. Conceptually, both the background and 
document DCM can be thought of as a Polya urn. The first approach to smoothing 
treats each of these models as distinct Polya urns. A document is generated by draw¬ 
ing with reinforcement, balls from the K urns according to a certain probability. Essen¬ 
tially this smoothing approach linearly combines the expected values (multinomials) 
of the Dirichlets. This general smoothing approach is as follows: 

K 

p{d\oLdm) = P{d\a.i) ( 10 ) 

i=l 

where = 1 on is the DCM model. In this work we only linearly combine 

two models, the document DCM and the background DCM, and therefore the SPUD 
retrieval model using this smoothing approach is defined as follows: 

~ (1 ^jm) ' ^jm * d ^ ( 11 ) 
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where \jm is the smoothing parameter and can be interpreted as the probability of 
selecting a te rm from the back ground DCM. We note that this formulation is identi¬ 
cal to that of Hiemstra ]1998| . Fig. |^(left) shows the graphical model for the DCM 
language model with this type of linear smoothing (Jelinek-Mercer). 

One of the main motivations for smoothing the document model with a background 
model is that the background model assigns mass to terms unseen in the document. 
Therefore, \jm can be interpreted as the probability of drawing a previously unseen 
term from the background model, and 1 — A as the probability of drawing a previously 
seen term (i.e. a repeated term from the document). During the generation of the doc¬ 


ument d, at least |d| previously unseen terms were drawn. This leads to an estimate of 
V = M1/MI as the probability of drawing an unseen term for that document model. 
This is the proportion of distinct terms in the document and is the estimate of draw¬ 
ing from the background multinomial O'c- The SPUD retrieval model with this type of 
smoothing is denoted SPUD^^ and it has no free parameters. We note that the esti¬ 
mation of Xjm is not a consequence of the DCM model, and can therefore be applied to 
the multinomial language model that uses Jelinek-Mercer smoothing. 


4.6.2. Linear Smoothing of DCMs. The second approach to smoothing uses a linear mix¬ 
ture of the DCM models. Conceptually, this approach to smoothing combines the con¬ 
tents of the K urns into one single Polya urn. A document is then generated by drawing 
with reinforcement from this single urn. This smoothing approach is a more complete 
Bayesian approach to smoothing and the parameters of the document model are as 
follows: 


K 

^dm — ( 12 ) 

where = ^ and a.i is the DCM language model. The w parameters are linear 

mixing parameters that determine the relative weight of the DCM language models. 
It is worth noting that each of the DCMs has a concentration parameter which act 
to weight the vector appropriately. Given the document DCM and background DCM 
estimated previously, the smoothing is as follows: 


Adm = {I - u:) ■ md- 6d+ tn ■ me - 9'c (13) 

where w is the linear mixing parameter. Fig. 1^ (right) shows the graphical model 
for this DCM mixture model. The expected multinomial drawn from this DCM mix¬ 
ture model is easily computed using the individual parameters of the DCM mixture 
model over the normalisation constant. This DCM mixture retrieval model is denoted 
SPUD dir due to the mixing of the Dirichlets. Although it seems that the DCM mixture 
model still has two unknownmarameters (i.e. TOc and to), these can either be combined 
to form one single parametei[^ or TOc can be estima ted using num erical methods as out¬ 
lined in the original work introducing the EDCM | Elkan|2006) . We outline the details 
of these approaches in the next section. 


5. RETRIEVAL MODEL IMPLEMENTATION 

In this section we outline the composition of the SPUD retrieval methods using both 
types of smoothing presented in the preceding section. We then present some retrieval 
intuitions that aid in understanding the retrieval aspects of the new model. 


®This is analogous to the tuning parameter p in the multinomial language model using Dirichlet prior 
smoothing. 
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5.1. Retrieval Functions 


Similarly t o the implementation of the standard multinomial models |Zhai and Laf- 
ferty [2004), our approach can be computed efficiently using a summation that only 


involves terms common to both document and query. The SPUDjm retrieval function 
using linear smoothing is as follows: 


SPUD,™(g, d) = + A,™ 


dft 


tGq 




:) •c(t,g)) 


(14) 


where Xjm = |d|/|d|. This is rank equivalent to the following: 


(1 — A,„i) • c(t, d) • V" |c?,| 

SPUD,™(q, d) = \q\ ■ log{X,^) + Y. (^05(1 + -- «)) (15) 

tG,nd Ml • dft 

The SPUD dir retrieval function can be computed in a somewhat similar form to the 
multinomial language model using Dirichlet prior smoothing as follows: 


SPUDd„(g,d) = ---—^ 

t^q (1 u!)-\d\+uj- rric 

which is rank equivalent to the following: 




SPUD dir iq,d) = l^l • log{- 




M + Ml tGqnd 

where g.' is a combination of w and rric as follows: 


|Jj • c(<, d) ■ V" |d, I 

) + ^ {iog{i + . • Mi, q)) 


fJ-' ■ Ml • dft 


(16) 


(17) 


= 




1 — id 


(18) 


As g' is the only parameter that has not been estimated so far, we now outline two ap¬ 
proaches to finding suitable values for it. The first approach is to experimentally tune 
g’ on training data in a similar man ner to the Dirichlet prior s moothing parameter g 
in the multinomial language ----- 
be estimated from the n i 
as follows: 


biiiiiiai iiiaiiiitJi i/U me j-ziiiuiiiei/ piiui o inuumiii^ jjai ciiiiei>ei /x 

^age model | Zhai and Laffer^|2001al . Alternat ively, rric can 
samples of observations using Newton’s method |Elkan|2006| 


m 


new 

c 


s;mm 

MdM I + "ic) - n ■ if {me) 


(19) 


where tp{x) = -^logT{x) is the digamma function and P is the gamma function. When 
estimating TOc ^om the data using this method, w is the parameter that requires ex¬ 
perimental tuning. However, we expect that the one setting for the hyperparameter 
uj will perform robustly across many test collections. Experiments for both of these 
approaches to determining a suitable values the free parameters are are outlined in 
Section 6.4. 


5.2. Length Normalisation and Document Boundary Retrieval Intuitions 

We now examine some retrieval intuitions and existing hypotheses that help explain 
the differences between the SPUD retrieval functions and the multinomial retrieval 
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functions. For most of the analysis in this section we focus on the best performing 
multinomial model (MQLt^j^) and its counterpart from the SPUD model (SPUD^ir)- 
Robertson and Walker |1994| outlined two hypotheses concerning the length of a doc- 
ument, namely the verbosity hypothesis and the scope hypothesis, which we now exam¬ 


ine. 


5.2.1. Verbosity Hypothesis. The verbosity hypothesis captures the intuition that some 
documents are longer than others simply because they are more verbose. Such docu¬ 
ments do not describe more topics, they are simply more wordy. This hypothesis cap¬ 
tures an aspect of document length that is independent of rele vance. However, the 
initial description of this hypothesis | |Robertson and Walker|19^ does not outline any 
formal means of determining whether a particular retrieval function is consistent with 
the hypothesis. We now outline a retrieval constraintj^ which helps to determine this. 

LNC2*. If document d and d' are two documents, where d' is constructed by concatenat¬ 
ing d with itself k times where k > 0, and if s{q, d) is the score returned from a retrieval 
function s which is used to rank d with respect to q, then s{q, d) = s{q, d'). 


This states that if a document is concatenated with itself any number of times, the 
retrieval score of that document should not change for a given query, and therefore it 
should not c hange rank. We call this constraint LNC2* as this is stricter than LNC2 
outlined by |Fang et al.|| |2004|, which only states that s{q,d) < s{q,d'). Essentially 
if a scoring function s adheres to LNC2*, then we deem s to be consistent with the 
verbosity hypothesis. 

Consider a relevant document d that is ranked in a certain position according to 
s{q, d). If d is replaced in the collection with d', d' should not be ranked lower than the 
initial document d. Therefore, s{q,d') should certainly not be less than s{q,d) simply 
due to the verbosity of d'. Now consider a non-relevant document d of a given length. If 
d is replaced in the collection with d', d' should not be ranked in a higher position than 
d originally was. Therefore, given that we do not know the relevance of d a priori, we 
argue that in general s{q, d') should not increase simply due to the increased verbosity 
ofd'. 

The maximum likelihood estimate of a term in a document (i.e. c{t,d)/\d\) will not 
change if that document is concatenated with itself any number of times. However, in 
the multinomial language model using Dirichlet priors smoothing (MQLdir), LNC2* is 
only satisfied when c(t, d)/|d| = c/t/|c| which is not often the case. For this model, if 
there are many query-term matches in d, the more verbose document d' will nearly 
always be ranked higher than d (i.e. s{q,d') > s(g,d)|^ while if there are very few 
query-term matches in d the verbose document d' will nearly always be ranked lower 
than d (i.e. s{q, d') < s{q, d)). However, if we examine the SPUDdi,. method in Eq. ( [T7| , 
we can see that the document vector length |d| is used as one form of document length 
normalisation. The document vector length |d| will remain unchanged for the concate¬ 
nated document d', and therefore SPUDdir(g, d) = SPUDdir-(9, d'). 

In general the multinomial model not only over-promotes recurrences of query terms 
but over-penalises recurrences of non-query terms in a given document. Fig. (left) 
shows the increase in weight as the term-frequency increases for both MQLdj,. and 


^Just prior to publication we found that a similar constraint has been previously been outlined in jNa et al.| 
|2008) . 

* We note that we are ignoring the effect that creating a longer document d' would have on the background 
collection model. For an extremely large collection this effect would be negligible. Furthermore, we note 
that the multinomial language model that uses Jelinek-Mercer smoothing adheres to LNC2*, while the 
SPUDjVi does not. However, there are other reasons for the generally weaker performance of the standard 
multinomial language model with Jelinek-Mercer smoothing. 
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SPUD dir- We can see that MQL^ir gives a greater weight to terms with higher fre¬ 
quencies than SPUD dir- This is because the aspect of document length that is affected 
as term-frequency increases is different for both retrieval functions. It is important 
that term-frequency is analysed considering the change in document length that an 
increase in term-frequency brings about. Fig. (right) also shows the penalisation 
due to recurrences of non-query terms for both MQLdir and SPUD dir- We can see that 
MQLdir penalises recurrences of non-query terms more than SPUD dir- In the SPUD dir 
function, recurrences of the same non-query term will always decrease the score of a 
document due to more off-topic verbosity. 




document length 


Fig. 3. Change in weight as term-frequency increases for MQLdir and SPUDdir in a document that initially 
contains 100 distinct terms (left). Change in weight as recurring non-query terms are added to a document 
that initially contains 100 distinct terms (right). 


Interestingly, it can be seen that the SPUDdir formula in Eq. ( |17| l contains the ratio 
between the term-frequency c(t, d) and the average term-frequency in the d ocument 


|d|/|d|. This average term-frequency normalisation idea was first proposed by Singhal 


et al.|pP^ , but in general was not sho wn to improve retrieval effectiveness substan¬ 
tially until recent research | Paik|[2M^ . It is this part of the SPUDdir retrieval model 
that deals specifically with the verbosity hypothesis, while the document length nor¬ 
malization component, the left-hand side of Eq. ( [iTl l, deals with the scope hypothesis 
by replacing the original document length with tEe document vector length. We now 
discuss this further. 


5.2.2. Scope Hypothesis. The scope hypothesis captures the alternative intuition that 
documents may be longer because they cover many different topics. It has been n oted 
in the original work regarding the scope hypothesis fRobertson and Walker] 1994) that 
many Newswire documents in the original TREC corpora seemed as if they consisted 
of multiple different news articles concatenated together. In the multinomial language 
model there is no difference in the normalisation applied when a term occurs for the 
first time (i.e. an increase in scope) as opposed to when a term repeats itself (i.e. an 
increase in verbosity). This difference is modelled in the new SPUD language models 
and can be viewed as being modelled separately for SPUDdir- In Eq. ( [TtI i, we can see 
that the factor |g| • log{u'/{u' + |d|)) leads to a penalisation only for the occurrence of 
distinct terms (i.e. when the scope broaden^. If the term re-occurs, it is not penalised 
by the part of the retrieval function which deals with scope. 

For the SPUDdir model, adding a non-query term into a document for the first time 
will lead to penalisation by the normalisation aspect that deals with scope. However, it 


®We assume that the number of distinct terms in a document is a crude measure of scope. 
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should be noted that the verbosity aspect of a document is also affected by the addition 
of previously unseen non-query terms and this actually promotes existing query-terms. 
Therefore, the overall document score does not necessarily decrease when a new non¬ 
query term is added. 

In the model, the magnitude of the document score penalisation for the first 

occurrence of a non-query term is quite similar to the penalisation applied by MQL^ij. 
(See Fig. but recurrences are not penalised as much. Given these observations, we 
hypothesise that the SPUD^jr retrieval method does not penalise long documents as 
much as MQL^ir- Recent research has studied th e over penalisation of long documents 
by man y retrieval functions including | Lv and Zhai||20lH . They built upon 

work by Singhal et al. 1 1996| which showed that most ranking functions retrieve long 
documents with a likelihood less than their likelihood of relevance. We replicate that 
analysis by binning according to length, relevant documents and then estimating the 
probability that a document occurs in a given bin (length) given that it is relevant. 
The same procedure is applied to retrieved documents where a document is deemed 
retrieved if it oc curs in the top 1000 documents of the ranked list. We use the same bin¬ 
ning strategy as Lv and Zhai [2011| (i.e. 5000) and compared the and SPUD^ir 

retrieval functions. The aspect of length used in this analysis is that of the number of 
word tokens in the document (i.e. |d|). 




1 10 100 1000 10000 
Document Length 


Fig. 4. Probability of retrieval/relevance for and SPUD^ii^ methods for trec-9/10 collection for short 

queries (left) and medium length queries (right). 


Fig. H shows the probability of relevance (in black) and the probability of retrieval 
for botn MQLdir (in blue) and SPUD dir (in red) on one col lection. Firstly, we note that 


Zhai |201l1 |. Furthermore, we can see that longer documents have a higher likelihood 


the trends are consistent with the previous approaches [Singhal et al.|[lM^ Lv and 

Ike 


of being retrieved by the SPUDd^r approach compared to the MQLdir approach. This 
confirms our intuitions that the SPUDdir model does not penalise long documents as 
much as MQLdjr and that we would expect the SPUD method to retrieve long doc¬ 
uments with a probability closer to their likelihood of relevance. We investigate this 
further in the experimental section (Section 6.5). 


5.2.3. Background model. The new background model in the SPUD brings about some 
other interesting retrieval characteristics. Given the sample collection in Table of 
four documents and two terms (U and ^ 2 ), we might wish to determine the most likely 
one term string, q = {ti} or q = {^ 2 }, generated from the background model. If we 
assume a multinomial background model estimated using maximum likelihood, then 
p{ti\6c) = 8/15 and p{t 2 \dc) — 7/15, suggesting that term q = {ti} is the more likely. 
However, intuitively we see that the high frequency of ti in document di is unduly 
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Table II. Sample collection 
of four documents and two 
terms 


docs 


*2 

di 

8 

2 

d2 

0 

1 

ds 

0 

3 

di 1 0 

1 


biasing the estimates, especially as term ti only appears in one document. Term t 2 
occurs in all of the documents, and therefore, is a word used more widely in the col¬ 
lection (possibly by more authors in general). The SPUD model takes the document 
boundary into account yielding estimates of p{ti\6'c) = 1/5 and p(t2\0'c) = 4/5 respec¬ 
tively. This probability is sim ilar to that proposed in one of the first language modelling 
approaches | Hiemstr^|1998|, and has recently been re-examined as being potentially 
theoretically valid |Roelleke|201^ . 



0 2000 4000 6000 8000 10000 

document frequency 


Fig. 5. idf and global weightings derived from the SPUD model 


As seen in this toy example, the proposed model uses the document frequency in its 
approximation for the parameters of the background DCM. Furthermore, the normal¬ 
isation component used in the background model can be written as Mil = ^ ' \d\avg, 

where \d\avg is the average document vector length. Therefore, in the SPUD^i^ retrieval 
formula, the weight assigned to a query-term that occurs in a document comprises of 
the following factor as per the right-hand side of Eq. ( |17| l: 

^ log{l + 6 ■ ■ c{t, q) (20) 

where <5 = |d| • \d\avg ■ c{t,d)/{p' ■ |d|). We can see that this factor can be viewed as a 
new family of idf Unlike the traditional idf measure, this factor is document-length, 
document-vector-length, and term-frequency specificfj We have found that 5 typically 
ranges from 0.05 to 0.5 for query terms on many of the collections used in this work. 
Fig. shows the weight assigned by idf and by Eq. ( |20l ) as the document frequency 
changes. This suggests that the global weighting factor m our new approach is closely 


note that |Sunehag||M07) has previously derived the traditional idf from a Polya process using slightly 
different assumptions. 
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related to idf. This crude comparison by no means validates the traditional idf in a the¬ 
oretical perspective, nevertheless it does present a theoretical means by which aspects 
of both term-frequency and document frequency combine in one model. In contrast, the 
multinomial language modelling approach treats terms that are completely indepen¬ 
dent of each other, written by different authors on different topics, similarly to terms 
that are highly dependent on each other (e.g. terms that are repeated, possibly due to 
association, in a document written by one author on a particular topic). Discovering 
a theoretical justification for the combination of both term-frequency and idf is prob¬ 
lematic as they appear to lie in different event space^^ The preferential attachment 
captured in the SPUD model is a promising generative theory justifying tf-idf type 
schemes. 

A practical consideration is whether there is substantial difference between the 
probability of a term given either background model (multinomial or DCM) when esti¬ 
mated from data. Therefore, we estimated the background probability of seeing a term 
for both models for all query terms on one of the test collections used in our experi¬ 
ments. We analysed 1530 query-terms from the trec-9/10 test collection and we found 
a high linear correlation (0.954) between the estimated probabilities for the terms. This 
is to be expected as the probabilities are fundamentally capturing similar information 
about a term. However, there are exam ples where the estimated probabilities of actual 
query-terms are quite different. Table [111] shows the top and bottom 10 terms when 
ranked according to the ratio of their probabilities (i.e. p(t\a.c)/p{t\6c)). The bottom 10 
terms show those that the background multinomial gives a much higher probability 
to when compared to the background DCM. It is interesting that the term el, which 
has much higher probability in the multinomial model, is a stopword from a different 
language. This receives a relatively high probability estimate from a multinomial be¬ 
cause it appears many times, but receives a much lower probability estimate from the 
background DCM because many of these appearances come from few documents (i.e. 
the term is quite bursty). The background DCM regards these terms as less general 
than the multinomial, as the occurrences have actually occurred in fewer documents. 
Conversely, the top 10 terms show those that the multinomial model has estimated 
as less general but which the background DCM has estimated as being more general. 
These terms are less bursty but have occurred in many documents in the collection. 

Therefore, given that there exists query-terms where the probability of occurrence 
under our new model is quite different, we would expect this to impact retrieval effec¬ 
tiveness. We evaluate the effect that the new document normalisation and background 
model have on retrieval effectiveness separately in Section 6.5. 

6. EXPERIMENTS 

In this section we outline the experiments used to evaluate the new SPUD methods. 
We first outline the experimental design and methodology, before presenting the ex¬ 
periments. 

6.1. Experimental Design 

We carry out four experiments to evaluate different aspects of the new SPUD query 
likelihood models. The first experiment evaluates the retrieval effectiveness of the new 
SPUD retrieval methods against a number of baselines. The second experiment evalu¬ 
ates the robustness of the tuning parameters in the SPUD retrieval methods. The third 
experiment presents an analysis of the retrieval intuitions outlined in the preceding 
section. Finally, we evaluate the best SPUD retrieval method when incorporating it 
into a pseudo-relevance feedback framework. 


'^See jRobertson|2004 for a thorough review of theoretical attempts to justify idf with term-frequency. 
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Table III. Ratio of estimated query-term probabilities of DCM to multinomial model 
{p(t\ac)/p{t\ec)) for a number of query-term on trec-9/10 


terms 

Bottom 10 

Top 10 

1 

vike 

0.3461802368 

funnel-shap 

2.0327764077 

2 

el 

0.3910938927 

undergon 

1.9616724275 

3 

cancer 

0.4098663257 

pejor 

1.9517391904 

4 

patient 

0.415028517 

tartin 

1.9495633384 

5 

cell 

0.4180174157 

superstiti 

1.9149743114 

6 

student 

0.4289654263 

gynt 

1.9071815267 

7 

drug 

0.4726560064 

interest 

1.8928123508 

8 

system 

0.4950172629 

unsuccess 

1.8803346212 

9 

law 

0.5064728539 

work 

1.8709780476 

10 

infect 

0.51433562 

run-awai 

1.8666031964 


6.2. Datasets 

Table IW_ shows the characteristics of the TRECf^test collections used in the experi¬ 
ments. We use a wide variety of TREC collections that are of varying sizes and include 
collections of Web documents, Newswire articles, and medical abstracts. In our exper¬ 
iments we evaluate short keyword queries (2-3 terms) consisting of the title field of 
the tree topic, medium queries (6-10 terms) consisting of both the title and description 
fields of the topics, and long verbose queries (10-30 terms) consisting of the title, de¬ 
scription, and narrative fields of the topic. We remove standard stopwords and apply 
stemming using Porter’s stemmer. It is worth noting that the ohsumed test collection 
contains only description length queries (i.e. medium length queries), while there are 
only title length queries available for the mq-07 and mq-08 test collections. 


Table IV. Test Collection Details 








query length 

label 

collection 

# docs 

# topics 

topic range 

short 

(title) 

medium 

(title+desc) 

long 

(title-Hdesc-Hnarr) 

ohsu 

ohsumed 

293,856 

63 

001-63 

n/a 

5.0 

n/a 

robust-04 

fr, ft,la, Ibis 

528,155 

250 

301-450, 601-700 

2.5 

10.3 

31.4 

trec-8 

wt2g 

221,066 

50 

401-450 

2.4 

9.0 

27.5 

trec-9/10 

wtlOg 

1,692,096 

100 

451-550 

2.6 

9.3 

24.3 

gov2 

gov2 

25,205,179 

150 

701-850 

2.8 

8.6 

33.3 

mq-07 

gov2 

25,205,179 

1778 

1-lOk 

3.1 

n/a 

n/a 

mq-08 

gov2 

25,205,179 

784 

10k - 20k 

3.7 

n/a 

n/a 


6.3. Retrieval Effectiveness 

The first experiment evaluates the retrieval effectiveness of the SPUD model against 
its counterpart, the standard multinomial query likelihood language model. We com¬ 
pare the SPUDjm retrieval function against the multinomial query likelihood function 
with Jelinek-Mercer smoothing (MQLym)- We tune the MQLym function for each set of 
queries to optimise mean average precision (MAP) on each test collection where the 
parameter space ir^rn G {0.1,0.2,..., 0.9,1.0}. Therefore, we are confident that the effec¬ 
tiveness of the MQLjm retrieval function is close to its optimal on each collection. On 
the other hand, we do not tune SPUD^^ as it has no free-parameters. 

We compare the SPUD^ir retrieval function against its counterpart, the multinomial 
query likelihood function with Dirichlet-prior smoothing (MQL^ij.)- Similarly, we tune 


^^http://trec.nist.gov/ 
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the MQLdir function to optimise MAP on each test collection where the parameter 
space /i e {250,500,..., 2250,2500}. We report the effectiveness of the retrieval 

function for same parameter setting as MQL^ir (i-e. = n'). This evaluation favours 

MQLdir as SPUD dir may not be tuned optimally _ 

We also use the DCM-L-T retrieval function | Xu and Akella 2008| which has a tuning 
parameter 7. We tuned 7 for^ch set of queries on each collection over the parameter 
space 7 G {0.1,0.2,..., 0.9,1.0} 


6.3.1. Retrieval Effectiveness Results. Tables IVl and IWl show the retrieval effectiveness 
(MAP and NDCG@20) of MQLjm compared to SPUD^m, and MQL^ir compared to 
SPUDdir for short title queries (2-3 terms on average). We can see that on most of 
the test collections the SPUD retrieval methods demonstrate a significant increase in 
effectiveness for both MAP and NDCG@20 over their corresponding MQL methods. 


Table V. MAP of SPUD models vs MQL models (A means two-sided t-test 
p < 0.01, A means p < 0.05) and SPUD models vs DCM-L-T {• means two- 
sided Mest p < 0.01 compared to DCM-L-T, o means p < 0.05 compared to 
DCM-L-T). 



short queries 


robust-04 

trec-8 

trec-9/10 

gov2 

mq-07 

mq-08 

DCM-L-T 

0.248 

0.306 

0.187 

0.288 

0.409 

0.413 

MQLjVn. 

SPUD,>, 

0.231 

0.236 

0.246 

0.255 

0.135 

0.154A 

0.245 

0.276a 

0.396 

0.411a 

0.419 

0.430a 

MQLdir 

SPUD^i, 

0.247 

0.252a 

0.308 

0.319A 

0.192 

0.200A 

0.303 

0.314a» 

0.420 

0.431a* 

0.427 

0.445AO 


Table VI. NDCG@20 of SPUD models vs MQL models (A means two-sided 
t-test p < 0.01, A means p < 0.05) and SPUD models vs DCM-L-T (• means 
two-sided t-test p < 0.01 compared to DCM-L-T, o means p < 0.05 compared 
to DCM-L-T). 



short queries 

1 

robust-04 

trec-8 

trec-9/10 

gov2 

mq-07 

mq-08 

DCM-L-T 

0.423 

0.449 

0.298 

0.455 

0.465 

0.495 

MQLjm 

SPUD^^ 

0.385 
0.398 A 

0.356 

0.384A 

0.220 

0.243A 

0.379 

0.418a 

0.458 

0.474A 

0.495 

0.503 

MQLdir 

SPUDd,, 

0.423 
0.432 A 

0.466 

0.477 

0.309 

0.322 

0.470 

0.492a* 

0.488 

0.500AO 

0.500 

0.513AO 


Tables VII and VIII show the retrieval effectiveness (MAP and NDCG@20) of MQLjm 
compared to SPUU^ J, and MQL^i^ compared to SPUD^^^ for medium length queries 
(6-10 terms on average). Again we can see that on most of the test collections the SPUD 
models demonstrate an increase in effectiveness for both MAP and NDCG@20. All of 
these increases are significant in the case of SPUD^^r- For long queries (10-30 terms 
on average) we see a similar trend. A point worth emphasising is that the increases 
in effectiveness are also present at the top of the ranked lists as demonstrated by 


NDCG@20. 

The SPUDdir approach outperforms the previous DCM relevance-based model 
(DCM-L-T) on most test collections. We have found that the DCM-L-T performs simi¬ 
larly to MQL^iir for short queries on some of the smaller collections, but we find that 
the DCM-L-T approach performs quite poorly on the larger gov2, mq-07, and mq-08 


i^The original paper does not outline a recommended parameter space. However when tuning from 0.1 — 1.0, 
a maximum stationary point for effectiveness was found for each set of queries. 
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test collections and for all medium and long queries. Statistical significance tests (us¬ 
ing a two-sided t-test indicated by o and •) show that the best performing SPUD model 
(SPUDdir) outperforms the DCM-L-T approach on some collections for short queries 
and consistently outperforms a tuned DCM-L-T approach for longer queries. We dis¬ 
cuss some possible reasons for these results in Section 7.1. 

Table VII. MAP of SPUD models vs MQL models (A means t\A/o- 
sided t-test p < 0.01, A means p < 0.05) and SPUD models vs 
DCM-L-T (• means two-sided t-test p < 0.01 compared to DCM-L- 
T, o means p < 0.05 compared to DCM-L-T). 



medium length queries 


robust-04 

trec-8 

trec-9/10 

gov2 

ohsu 

DCM-L-T 

0.266 

0.296 

0.181 

0.256 

0.255 


0.277 

0.283 

0.191 

0.276 

0.239 

SPUDj,„ 

0.280 

0.291 

0.203AO 

0.299a* 

0.248A 

MQLdir 

0.281 

0.325 

0.238 

0.315 

0.253 

SPUDdi, 

0.289A. 

0.347a» 

0.247a» 

0.329a* 

0.270a* 


Table VIII. NDCG(3)20 of SPUD models vs MQL models (A means 
two-sided t-test p < 0.01, A means p < 0.05) and SPUD models vs 
DCM-L-T (• means two-sided t-test p < 0.01 compared to DCM-L-T, 
o means p < 0.05 compared to DCM-L-T). 



medium length queries 


robust-04 

trec-8 

trec-9/10 

gov2 

ohsu 

DCM-L-T 

0.435 

0.436 

0.318 

0.401 

0.396 

SPUD,^ 

0.455 

0.456 

0.412 

0.440A 

0.329 

0.344A* 

0.431 

0.463A* 

0.397 

0.391 

MQLdir 

SPUDrf,, 

0.465 

0.479A* 

0.478 

0.500A* 

0.393 

0.403A* 

0.484 

0.502a* 

0.399 

0.415AO 


Table IX. MAP of SPUD models vs MQL models (A means 
two-sided t-test p < 0.01, A means p < 0.05) and SPUD 
models vs DCM-L-T (• means two-sided t-test p < 0.01 
compared to DCM-L-T, o means p < 0.05 compared to 
DCM-L-T). 



long queries 


robust-04 

trec-8 

trec-9/10 

gov2 

DCM-L-T 

0.239 

0.225 

0.181 

0.235 


0.284 

0.269 

0.211 

0.265 

SPUD,>„ 

0.288A* 

0.269 * 

0.206 o 

0.285 A* 

MQLcijy. 

0.283 

0.283 

0.248 

0.296 

SPUDdi, 

0.296A* 

0.314a* 

0.254 * 

0.323 A* 


Figj^shows the performance of MQL^ir vs SPUD dir for each query on two separate 
test cmlections. This query specific analysis indicates that SPUD^ir is robust on all 
ranges of queries (from easy to difficult). For longer queries on the robust-04 dataset, 
there are one or two high performing queries which drop over 0.1 in average preci¬ 
sion. However, in general there are very few queries which severely under perform 
compared to MQLdi^. On the trec-9/10 web documents, the increase in performance is 
stable across all types of queries for all query lengths. 
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Table X. NDCG@20 of SPUD models vs MQL models (A 
means two-sided t-test p < 0.01, A means p < 0.05) and 
SPUD models vs DCM-L-T (• means two-sided Mest p < 
0.01 compared to DCM-L-T, o means p < 0.05 compared 
to DCM-L-T). 



long queries 


robust-04 

trec-8 

trec-9/10 

gov2 

DCM-L-T 

0.404 

0.346 

0.318 

0.400 

MQLj rn 

SPUD,>, 

0.469 
0.476 • 

0.430 
0.431 • 

0.354 
0.356 • 

0.535 
0.541 • 

MQLfiir 

SVVDdir 

0.467 

0.483a. 

0.446 

0.475A. 

0.406 
0.409 • 

0.572 
0.599 A* 


Q 

D 
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D 
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CL 
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Q 
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CL 
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AP for MQL 


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
AP for MQL 


0.4 0,6 

AP for MQL 



AP for MQL 


Q 

D 

CL 

W 

o 

CL 
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Fig. 6. Average precision of all short, medium, and long queries for vs SPUD^ir ori robust-04 

dataset (top) and trec-9/10 (bottom) 


6.4. Robustness 

The second experiment evaluates the robustness of the SPUD models with respect to 
different parameter settings. In addition, we evaluate the retrieval effectiveness of the 
SPUDdir model when t he parameter p' is derived from the estimated parameter ruc 
using Newton’s method | Elkani2006) . 


6.4.1. Robustness Results. Fig. shows the performance of over different tun¬ 

ing parameter values (i.e. Tr^m) and the performance of the SPUDj^ model. We can see 
that SPUDjm, which has no free parameters, outperforms SPUDj>„ over all parameter 
values. This trend is consistent on all test collections used here. 

For the SPUD^ir function, we can estimate ruc using Newton’s method as outlined 


in Fq. (19 1 given the n documents as data. We found that an initial value of me = 200 
was suitaBle so that the process converged within 20 iterations. This computation can 
be done off-line and we used the resulting setting of TOc to estimate fj.' by tuning the 
hyperparameter w to a fixed value. We set w = 0.8 which is demonstrated in Fig. as 
a reasonable setting. Fig. shows the performance (MAP) of the SPUD^ir- modelror 
different values of oj when me is estimated using Newton’s method. The relationship 
between uj and in Fq. ([l^ essentially suggests that fi' = 4 toc is a suitable parameter 
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0.24 
0.235 
0.23 
I 0.225 
0.22 
0.215 
0.21 


Fig. 7. Robustness comparison of MQLjm and SPUDjm on robust-04 (left) and trec-9/10 (right) for short 
queries 

value for Although nic is the only parameter that is expensive to estimate in the 
SPUDdir model, it is practically feasible to do so offline. When the parameter is 
computed in this way (i.e. ^' = 4 • md, we denote this SPUDest^, in the experimental 
results and figures that follow 




(0 


Fig. 8. Tuning of ui in SPU Dnr,> on robust-04 (short), trec-8 (short), and ohsu (medium) collections when rric 
is estimated using Eq. jl9f 


Fig.j^shows the performance of MQL^ir, SPUD^ir, and SPUDgst^, over different pa¬ 
rameter settings on a number of test collections. We can see that SPUD outperforms 
MQLdir over all parameter values. We can see that the parameter p' is as robust as 
the parameter fj., as it tends to follow the same trend. 

More importantly we see that near optimal effectiveness can usually be achieved 
by using the automatically estimated value of rric found using Newton’s method. This 
is rather encouraging as it means that the setting of w = 0.8 is robust and that we 
can effectively and safely eliminate from SPUD^ir the free parameters. In particular, 
this automatic optimal estimation can be seen when we examine in Fig.j^the trec-9/10 
collection (which contains long Web documents) and the robust-04 collection (which has 
shorter documents). For the robust-04 collection, the retrieval effectiveness decreases 
sharply when p' becomes greater than 1000. On the other hand, for the trec-9/10 the 
effectiveness is more stable when fi' is greater than 1000. One probable reason for 
this is that the average length of the documents in those collections is very different. 
However, the automatically estimated SPUDest^, is close to optimal on both collections. 
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Fig. 9. Robustness of SPUD^i^ and over different values of the tuning parameter p (or p') on 

robust-04 (short) and trec-9/10 (short) respectively 


Table El reinforces this observation. Table El shows the characteristics of the av¬ 
erage length of documents in the collections and the value of me that is estimated on 
each collection. We can also see that me is correlated with the lengths of documents 
in the collections. Furthermore, in the same table we can see that close to the optimal 
effectiveness is possible by setting w = 0.8 for SPUDes*^, . This is because me is essen¬ 
tially performing the tuning on a per collection basis. The parameter me has a very 
intuitive interpretation as the initial mass of the background Polya urn. 


Table XI. MAP comparison of SPUD^ji,, model for well-tuned p' and 
SPUDest^, which uses an automatically estimated value of p' 



robust-04 

trec-8 

trec-9/10 

ohsu 

gov2 

mq-07 

mq-08 

\^avg 

162 

242 

157 

68 

181 

181 

181 

\d\avg 

265 

558 

344 

104 

529 

529 

529 

rric 

258 

421 

326 

112 

234 

234 

234 

u 

‘S 

II 

‘ S 

1034 

1688 

1308 

448 

936 

936 

936 


short queries 

SPUD^i, 

0.252 

0.319 

0.200 

n/a 

0.314 

0.431 

0.445 

SPUDe.t^, 

0.249 

0.320 

0.199 

n/a 

0.314 

0.429 

0.443 


medium queries 

SFVDiir 

0.289 

0.347 

0.247 

0.270 

0.332 

n/a 

n/a 

SPUDe.t^, 

0.287 

0.344 

0.246 

0.270 

0.329 

n/a 

n/a 


long queries 

SFVDiir 

0.296 

0.314 

0.254 

n/a 

0.323 

n/a 

n/a 

SPUDe.t^, 

0.295 

0.307 

0.255 

n/a 

0.322 

n/a 

n/a 


6.5. Analysis of Retrieval Model Aspects 

In this third experiment we aim to evaluate the retrieval effectiveness of the new back¬ 
ground model (i.e. etc) and the new smoothing methods in the SPUD model separately 
in a piece-wise fashion. We gradually adapt parts of the multinomial query likelihood 
functions until the SPUD retrieval functions are comprised. The experiment pinpoints 
the parts of the SPUD retrieval functions that lead to changes in retrieval effective¬ 
ness. This piece-wise adaptation provides evidence that the individual retrieval intu¬ 
itions outlined in Section 5.2 are valid. Furthermore, we conduct an analysis of the 
retrieval characteristics of the best performing methods. 
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6.5.1. Results of the Analysis of Retrieval Model Aspects. Table pOI] which also contains a 
column for a hybrid model, outlines the parameter values for the functions used in 
this experiment. Essentially, this hybrid retrieval function differs from the SPUD re¬ 
trieval functions only in the fact that it uses different parameter estimates for Ajm and 
rrid that effect the smoothing for SPUDj^ and SPUD^ir- respectively. The changes to 
these parameter estimates makes the hybrid model closer to the multinomial retrieval 
functions. The only difference between the MQL and hybrid model is that hybrid uses 
the expected multinomial of the background DCM (i.e. 0'^ in Eq. as its background 
model. 


Table XII. Decomposition of Retrieval Functions 




MQL 

hybrid 

SPUD 

Smoothing 

Colour 

Multinomial 

DCM 

DCM 

Jelinek-Mercer (jm) 

Blue 

'^jm — 0-2 

^jm = 0.2 

= 1 dl/MI 

Dirichlet (dir) 

Red 

fi = 2000 

p' = 2000, rrid = |d| 

fi' = 2000, rrid = |d| 


0.35 
0.34 
0.33 
0.32 

Q. 

< 0.31 

0.3 
0.29 
0.28 
0.27 

short medium long 
Query Length 

0.28 
0.27 
0.26 
0.25 
Q. 0.24 
2 0.23 

0.22 
0.21 
0.2 
0.19 

short medium long 
Query Length 

Fig. 10. Analysis of performance gains from different parts of the SPUD retrieval models for trec-8 (top) 
and trec-9/10 (bottom) for short, medium, and long queries. Models that use Jelinek-Mercer smoothing are 
on the left-hand, side while those that use Dirichlet smoothing are on the right-hand side. 



short medium long 
Query Length 





short medium long 
Query Length 


Fig. shows the effectiveness of the functions that use Jelinek-Mercer smoothing 
(left-hand side) and those that use a type of Dirichlet smoothing (right-hand side) on 
two test collections. In general, the use of the new background DCM model aids re¬ 
trieval as we can see an increase in effectiveness for the hybrid (in black) retrieval 
functions over the MQL functions (in blue). We note that the magnitude of the differ¬ 
ence is small, and that in some cases the performance decreases slightly. In general, 
the introduction of a document boundary into the estimation of the background lan¬ 
guage model is more effective for SPUD^i^ than for SPUDj^. 
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However, the different smoothing techniques introduced in the SPUD model yield 
a greater increase in performance i.e. when comparing the SPUD function (in red) 
to the hybrid function (in black). The smoothing in the SPUD model, amongst other 
factors, affects document length normalisation and improves retrieval effectiveness 
substantially. The results in Fig. 10 demonstrate^ that the new retrieval character¬ 
istics brought about by both the bamground DCMand the document DCM positively 
influence retrieval effectiveness. The results of these experiments further validate the 
use of the DCM as a more plausible document model than the multinomial. This is be¬ 
cause the changes to the query-likelihood retrieval method that the new background 
and new document model bring about, increase the retrieval effectiveness for SPUD^ir 
method over MQL. 

Previously in Section 5.2.2 we analysed the lengths of documents retrieved in the 
top 1000 documents by both and SPUD^ir- We found that SPUD^ir was more 

likely to retrieve longer documents. We now look at the length characteristics of the 
top 20 documents returned per query by both SPUD^ir and MQL^ir- to determine if the 
differences in length are correlated with increased performance in terms of NDCG@20. 
Firstly Figjll| confirms that on average SPU D^^^ r etrieves documents with a longer 
vector lengtlithan MQL^ir in the top 20. Table shows the correlation between the 
differences in average length and the differences in NDCG in the top 20 documents 
across a number of representative test collections. We report a small but insignificant 
correlation between the increase in average vector length and query effectiveness (as 
measured by NDCG@20). Although this correlation analysis is somewhat inconclusive, 
we can confirm that on average SPUD^ir retrieves documents with a longer vector 
length (i.e. greater number of distinct terms), and that the overall evidence seems to 
suggest that this is leading to increase effectiveness. 




Fig. 11. Difference in average vector length of the top 20 returned documents for SPUD^;,. and on 

trec-8 (left) and gov2 (right) web collections for short queries. 


Table XIII. Linear correlation of A average document 
length in top 20 and A NDCG@20 over short queries 
sets for Web collections 


avgJen 

trec-8 

trec-9/10 

gov2 

\d\ 

0.0525 

0.0462 

-0.0161 

\d\ 

0.0838 

0.0622 

0.0216 


* Results on other test collections used in this work are consistent with those reported in Fig.|l0| 
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6.6. Pseudo-Relevance Feedback 


Finally, we evaluate the SPUD model in a pseudo-relevance feedback setting. Pseudo¬ 
relevance feedback is a useful approach for expanding short queries when the user 
has not entered a sufficiently long query. In essence, the pseudo-relevance model is 
responsible for the selection and weighting of candidate expansion query-terms from 
the top k documents of an initial retrieval run . We adapt the state-of-the-art RMS 
fAbdul-jaleel et al.|2004[ [Diaz and Metzler]2006| approach to select and weight terms 
according to the SPUD^iV retrieval approach. The pseudo-relevance model based on 
SPUD dir is estimated from an initial ranking as follows: 


p{t\qe) = 


O-dm^Rc 




^2ot'dm&Ra = E[0dm|Q! dm]) 


( 21 ) 


where Ra is the set of pseudo-relevant document models (i.e. it is the top k docu¬ 
ment models from an initial retrieval run). If we replace p{t\adm) with p(t\6dm) and 
p{q\Md = ^[Odm\oLdm]) with p{q\9dm) in Eq. (|^, we recover RMS. The final query 
model is then estimated by linearly smoothing this estimated relevance model p{t\qe) 
with the original query as follows: 


p{t\q') = T ■ p{t\q) + (1 - r) • p{t\qe) 


( 22 ) 


where r controls the weight of the initial query. The new query model is then used to 
query the corpus using the initial retrieval method (i.e. SPUD dir)- We set the number 
of pseudo-relevant documents k = 20 and generate a pseudo-relevance model of 50 
terms. We smooth the pseudo-relevance model with the original query model by setting 
r = 0.5. The parameter u' (and u in MQLdir) is set to 2000 during ranking and is 
set to 0 only during the ex pansion step. These expansion parameters settings are set 
according to the literature | Abdul-jaleel et al.|2004t Lv and Zhai|2OO9bl[^0lO| . We note 
that the pseudo-relevance model here does not follow a DCM relevance model (i.e. we 
do not treat all relevant documents as being drawn from a DCM relevance model), 
but is simply an adaptation of the RMS model which we refer to as PURlVf^ We only 
use short title queries in this experiment as are the types of queries to w^ch query 
expansion is typically applied ICarpineto and Romano 2012|. 


6.6.1. Pseudo-Relevance Feedback Results. Table PUVI shows the results of the pseudo¬ 
relevance feedback experiment. Firstly, we can see that when the SPUDdir approach 
is used as the retrieval method with the RMS expansion approach, it leads to a signifi¬ 
cant improvement over the MQL approach. This is encouraging, but hardly surprising, 
as the SPUD dir approach has a more effective initial retrieval. However, when the re¬ 
trieval method is static, and only the expansion approach is allowed to vary, the PURM 
approach outperforms the RMS approach. The absolute increase in effectiveness when 
using the new PURM expansion approach is quite low, but nevertheless is significant 
on trec-8 and gov2. This low increase in effectiveness is to be expected as the only dif¬ 
ference between the RMS expansion approach and the PURM approach (when u and 
u' are set to 0) is that the PURM approach uses the SPUD retrieval score to weight 
terms, while the RMS approach uses the MQL retrieval score. Overall, while this val¬ 
idates that the SPUDd^^ document retrieval score is useful in the expansion step of 


^®Essentially, the PURM expansion model with u' = 0 only differs from RMS with ri = 0 in the fact that the 
document retrieval score used to weight the expansion term is different. Therefore, we would expect only a 
small difference in effectiveness. 
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pseudo-relevance expansion approaches, the main increase in effectiveness comes from 
the better ranking of SPUDji^ compared to MQL^/^. 

A point worth noting is that the performance of the feedback approaches on the mq- 
07 and mq-08 test collections are worse than for the initial retrieval run (no expan¬ 
sion). It has been reported that pseudo-relevance feedback varies depending on the 
type and quality of the test collection with results showing little or no i mprovemen t 
when using parts of the million query track data (i.e. mq-07 and mq-08) IMeij 20101. 
One possible reason for this is that during the creation of the mq-07 and mq-08 test 
collections a shallow pool depth was used in order to judge more queries than is usual 
for tree collections. As pseudo-relevance feedback tends to increase average precision 
by increasing recall, the lower number of judged documents for the million query track 
collections could affect the natural behaviour of query expansion approaches on this 
collection. 


Table XIV. MAP of pseudo-relevance feedback approaches of SPUDdi^-PURM, 
SPUDdi^-RM3, and MOL^ir-RMS (A means t\A/o-sided t-test p < 0.01 compared to 
MQL(iir-RM3, while A means p < 0.05 compared to MQL£ji^-RM3. • means two-sided 
t-test p < 0.01 compared to SPUDdir-RM3, o means p < 0.05 compared to compared 
to SPUDdi,.-RM3.) 


Methods 

short queries 

Ranking 

Expansion 

robust-04 

trec-8 

trec-9/10 

gov2 

mq-07 

mq-08 

MQLdir 

None 

0.2S2 

0.S08 

0.191 

O.SOS 

0.428 

0.440 

MQLdir 

RMS 

0.258 

0.S22 

0.212 

0.S08 

0.S95 

0.417 

SPUDd„ 

RMS 

0.265a 

0.838A 

0.218a 

0.S19 

0.404 

0.428 

SPUDdi, 

PURM 

0.266a 

0.S40a» 

0.220A 

0.S24 Ao 

0.408a 

0.429a 


7. DISCUSSION 

In this section we discuss the main findings, limitations, and the broader impact of 
this work. 


7.1. Comparison With Previous Work 

The results of experiments in Section 6 .3.1 have shown that the SPUD^ir method sig¬ 
nificantly outperforms the DCM-L-T of Xu and Akella ]2008[ . In particular, the effec¬ 
tiveness of DCM-L-T for longer queries, which was not presented in the original work, 
is particularly poor. The manner in which the initial query is used in that relevance- 
based model leads to a non-linear query term-frequency aspect. This is likely to affect 
the retrieval effectiveness for longer queri es as it has been s hown that t he query term- 
frequency aspect should be close to linear |Robertson and Walker|l994] . 

There are several other disadvantages to the DCM-L-T method. While the complex¬ 
ity of most retrieval functions is linear with respect to the number of unique terms 
(w ord types) in common to both query and document, the complexity of the approach 
by |Xu and Akella] | |2008) is linear with respect to the sum of the query-term frequen- 
cies (i.e. all instances of query-terms) in the document. This adversely affects retrieval 
time. Conversely, the SPUD model outlined in this work is as efficient at query time 
as the multinomial language model. 

In the DCM-L-T approach, the estimation of the parameters for both the relevant 
and the non-relevant DCM document models do not have closed-form expressions. This 
is not of major concern for the estimation of a non-relevant model in a static collectiorj^ 


i®For a dynamically changing collection where new documents are discovered and indexed frequently, this 
may become an issue. 


ACM Transactions on Information Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010. 
































39:30 


(which can often be estimated off-line), but is a major disadvantage for the inference of 
the relevant model, which must be estimated on-line at query time. In fact, one of the 
major difficulties with the previous relevance-based approach is estimating the set of 
pseudo relevant documents needed in order to infer the relevance model. Therefore, a 
number of computationally expensive estimation techniques are compared in order to 
find parameters that are the most effective in terms of retrieval. However, it was found 
that a manual tuning of 7 is more effective than any of these estimation techniques. 


7.2. Estimating Free Parameters 

In Section 6.4 we have shown that both SPUD retrieval methods are more robust in 
terms of parameter settings than their multinomial counterparts. We have shown that 
for the SPUDdir model, the background model is weighted approximately four times 
more than the document model, and that this setting (via uj = 0.8) is robust across dif¬ 
ferent collections. More extensive research would need to be conducted to determine 
if this setting is universal. Some prior research into Microblog retrieval suggests that 
a smaller parameter value in the multinomial qu ery likelihood model is more effec - 
tive on collections that contain smaller documents [Han et Mr||2012 Kim et al.|2012| . 
This is consistent with our results (emphasised by results on the ohsu collection which 
contains short documents) as the estimate of ruc is correlated with document length 
(see Table pQ]). This provides further evidence that our free hyperparameter uj is more 
robust than the free parameter in the multino mial model. _ 

Furthermore, although it has been suggested jZhai and Lafferty|2004) that the pa¬ 
rameter /i in the original multinomial language model may be affected by query length, 
we have found that the most effective SPUD retrieval method is robust across queries 
of different length. More work would need to be cond ucted to see if the opt imal value 
of UJ varies according to query length. Recent research jTsagkias et al.'201H has inves¬ 
tigated a different generative model for queries, and this would also be an interesting 
future direction to explore. 

The background model in S PUD is only an efficient approximation to the DCM. 
Although, it has been shown [Elkan 20061 that this EDCM approximation is quite 
accurate and has been shown to be useful for text clustering, more extensive work 
would need to be conducted to determine if the approximation is close to optimal in 
terms of retrieval effectiveness. 


7.3. Theoretical Discussions 

7.3.1. Term and Document Event Spaces. Aspects of both term-frequency and inverse 
document frequency have been at the core of many successful ranking functions over 
the years. The work outlined here helps to explain why both of these features have 
been so useful. In particular, the generative assumptions made in our document model 
help explain why term-frequency is such a useful and salient measure of topicality. In 
other words, we argue that it is because authors have preferential attachment for the 
content words within-documents that term-frequency is such useful measure of topi¬ 
cality. Furthermore, the se generative assumptions lead to po wer-law characteristics of 
term-frequency in text ||Siinon||1955[ Goldwater et al.||2011) , and therefore appear to 
be more plausible modelil 

Interestingly, it is because of within-document preferential attachment that inverse 
document frequency is such an accurate measure of term-specificity. Essentially, when 
analysing the collection-wide characteristics of terms, for the most part we need only 
count the first occurrence of a term within a document, as all other occur rences depend 


upon this. While we did not derive idf as it appeared in its original form | Sparck-Jones| 
1972|, our analysis shows that the best retrieval formula derived from the SPUD lan- 


guage model, contains characteristics closely related to that of idf (see Fig.[^. By cap- 


ACM Transactions on Information Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010. 

























39:31 


turing burstiness in our framework we have been able to successfully combine the term 
event space used within each document, with the document event space used at the 
collection-wide level (which comes about as a close approx imation to the background 
DCM). Others |Robertson||2004[ |Roelleke and Wang||2008 1 have argued that Harter’s 
eliteness hypothesis |Harter|1975a|b| ,' which is essentially a binary latent variable for 
each term, acts a bridge between the term space and the document space. We have 
found that there are alternative generative explanations for tf-idf type schemes. We 
believe that the SPUD language model is an important step towards developing a 
probabilistic generative theory explaining such schemes. 


7.3.2. Relevance. We note that our retrieval model is a query-likelihood model which 
does not explicitly model relevance; however it is not difficult to place the same doc¬ 
ument model in a relevance framework. The KL (Kullback-Leibler) divergence, which 
measures the amount of information lost when one distribution is used to model 
another theoretical distribution, has been used in information retrieval to compare 
document models to query models. As this introduces the idea of a query model, it 
seems reasonable to imagine that this query model is a best initial approximation 
of the true relevance model (which can be updated as relevance information becomes 
known). Therefore, one can think about ranking documents according to the negative 
KL-divergence of a document model Md and a true relevance model Mr as follows: 


KL{Mr\\Md) = -Y,p{t\Mr) ■ 

tdv 


(23) 


It is also well-known that the query-likelihood function is rank equival ent to the KL- 
divergence between a qu ery model and document model as a special case | |Zhai and Laf- 
ferty|200Tb ; Zhai|2008| . The above equation is rank equivalent to the SPUD retrieval 
functions when p{t\Mr) is estimated using c{t,q)/\q\ and when p{t\Md) is estimated 
using the new document models presented in this article (i.e. cxdm)- 

7.3.3. Document Length Normalisation. In Section 5.2.1 we defined a constraint to cap¬ 
ture the verbosity hypothesis. We have shown that the best performing SPUD retrieval 
method adheres to this constraint. We have seen that in general the multinomial 
model over-penalises long documents and the SPUD^ir model is more likely 

to retrieve longer documents (See Fig. B. This is because the multinomial model does 
not model the distinction between wora-types and word-tokens, and ultimately over- 
penalises documents with recurrences of non-query terms. Thi s result builds on recent 
research | Lv and Zhai |201H Cummins and O’Riordanj|2012| that developed further 
constraints regarding documient length normalisation. It would be interesting future 
research to determine if the SPUD^ir- function adheres to these constraints also. 

The SPUD model significantly outperforms a highly tuned multinomial model 
(MQL) for all query lengths. This is because the SPUD model incorporates two types of 
document length normalisation. One aspect of normalisation (verbosity) regulates the 
term-frequency with respect to the document length as longer documents (those with 
many word tokens) are more likely to contain higher term-frequencies. The another 
aspect (scope) normalises longer documents (those with more word types) as they are 
more likely to contain more distinct query-terms. This second aspect of normalisation 
is crucially dependant on query length. The SPUD model is the first model to combine 
these two aspects of document normalisation in a theoretically principled framework. 

Interestingly, recent research has developed a two-stage document length normalisa¬ 
tion framework | Na|2015l l which incorporates both verbosity and scope normalisation 
into retrieval methods. It is appealing that the SPUD retrieval methods derived from 
our probabilistic framework contain these aspects of normalisation naturally. 
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7.4. Broader Impact 

While we have argued that the new SPUD model addresses a number of theoretically 
interesting questions in IR, we have demonstrated that it also practically useful in 
a retrieval scenario. Given that the SPUD model is essentially a method for deter¬ 
mining principled term-weights for document vectors, the model is likely to be useful 
in other areas where term-weights are used in vector representations of longer texts. 
This includes areas such as text classification, text clustering, and more specialised 
NLP tasks (e.g. keyword extraction, automatic summarisation). 


7.5. Recommended Retrieval Function 

The recommended retrieval function is SPUD^^^ in Eq. ( |17| l. This function has one free 
parameter which we recommend setting to 4 • me, where rUc can be found by applying 
Newton’s method to Eq. (191. Alternative can be experimentally tuned on training 
data which is the current method of setting u in the multinomial language model. 


7.6. Future Directions 

The most effective SPUD method, introduced in Eq. ( |12[ l, linearly combines the back¬ 
ground DCM model with the document DCM language model. This could be extended 
to include more language models. For example, if we had information relating to au¬ 
thorship, we could estimate an author specific DCM language model that would ex¬ 
plain textual characteristics specific to an author, as it may be the case that certain 
authors are generally more verbose than others. Smoothing this DCM model with both 
the document and background models may further improve performance. This may be 
particular useful in areas such as expert search. 

The document model outlined in this work models word burstiness in a document 


specific manner. Previous work | Kwok 1996| has shown that certain terms are more 
bursty than others (i.e. they are more likely to repeat). This suggests that incorpo¬ 
rating a term-specific aspect of burstiness may increase retrieval effectiveness even 
further. This could be modelled using a more general urn model where the level of 
reinforcement varies per term. 

A further interesting direction is to consider integrating the document model out¬ 
lined here with a model that incorporates the traditional notion of term-dependence. 
The details regarding such a combination have not been discussed here but would 
present interesting future research. 


8. CONCLUSION 

We have introduced a new family of language model (namely SPUD) based on a Polya 
urn process. We have shown that a query likelihood retrieval method based on this 
model is superior to that of the state-of-the-art multinomial language model. Inter¬ 
estingly, we have shown that the new model can be computed as efficiently as the 
multinomial language model. Essentially, this means that the SPUD retrieval method 
can be used in place of the multinomial query likelihood method in many different 
retrieval applications and domains. 

We have outlined a number of intuitions that help to motivate the new model. For 
example, we developed a constraint for the verbosity hypothesis and have shown that 
the most effective SPUD method, the SPUD^^i^ model, adheres to this constraint. Fur¬ 
thermore, we have shown that the free hyperparameter (i.e. w = 0.8) in the SPUD^ir 
method is robust across various collections. This essentially reduces the need for ex¬ 
perimental tuning. Given the principled nature of the approach developed, it can be 
used in a variety of IR tasks. We have shown that it is useful for downstream retrieval 
methods, as we have used it to estimate a pseudo-relevance based model (PURM) that 
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demonstrates improved retrieval effectiveness on test collections when compared to a 
pseudo-relevance model based on the multinomial (RMS). 

Future work will look to improve retrieval effectiveness by incorporating multiple 
DCM language models for modelling a document. Furthermore, we aim to investigate 
the query likelihood method using different generative assumptions for the query. In 
this work, we assumed a sampling-with-replacement strategy for query generation. 
However, differen t sampling strategies, such as those employed by Friedman urn’s 
I Freedman) 1965] might better model query generation. 
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